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It  is  shown  that  some  recent  results  of  Wong  (1983)  concerning  his  ver¬ 
sion  of  the  modified  likelihood  criterion  for  smoothing  parameter  selection 
in  kernel  density  estimation  can  be  very  misleading,  because  the  wrong  mode 
of  convergence  is  established.  An  example  is  given  to  demonstrate  that  the 
results  are  false  when  a  more  reasonable  mode  of  convergence  is  used.  Slight¬ 
ly  stronger  conditions  are  added  and  valid  proofs  for  the  correct  version  of 
these  results  are  indicated. 


I.  Introduction 


Let  X  =  {Xp...,Xn)  be  a  random  sample  from  a  density  f.  The  kernel  es¬ 
timator  of  f  is 

(1.1)  fx(x,X)  =  n"1  l  IC^x-Xj), 

i=l 


where 

Kx(x)  - 

The  central  problem  in  the  field  of  density  estimation  is  the  choice  of  the 
smoothing  parameter  or  bandwidth,  X.  As  noted  in  Wong  (1983),  several  authors 
have  proposed  selecting  X  by  maximizing  the  cross-validation  function 

(1.2)  R^X)  =  n"\?  log  fx(X.,X(i)) 

where  X  denotes  the  "leave  one  out"  sample: 

'll) 

X\(X.}. 

Wong  (1983)  studies  the  behavior  of  (1.2)  by  looking  for  conditions 
under  which  (letting  *  denote  convolution) 

(1.3)  fP(X)  /log(f*KA)dF. 


This  relationship  is  then  used  to  establish  an  asymptotic  equivalence  of 
(1.2)  with  a  Jackknife  selector  of  X  given  by 

n 


RJack(X)  =  n^J^log  fx(XifX)-log  fx(Xi,X(.))4n_1 /^log  f^X^)] 


-1 


In  the  present  paper  it  will  be  shown  that  Wong's  results  need  very  cautious 
interpretation  for  two  reasons. 


First,  the  proof  of  the  pivotal  Theorem  1  in  Wong  (1983)  is  based  on 

a  Law  of  Large  Numbers  for  a  sequence  of  independent,  identically  distributed 

random  variables  in  Banach  space,  which  forces  the  smoothing  parameter,  X, 

to  be  fixed  as  n  +  ».  Allowing  X  to  vary  with  n  necessitates  consideration 

of  arrays  of  random  variables  in  Banach  spaces  where  each  row  is  an  i.i.d. 

sequence  (see  Taylor  (1982)  and  Taylor  (1983)  for  results  of  this  type) . 

This  type  of  asymptotics  provides  a  poor  model  for  studying  kernel  density 

estimation  because  it  is  well  known  that,  as  n  ■+  °°,  one  needs  X  -*•  0  to  even 

have  consistency  of  f^  (i.e.:  convergence  to  f ) .  In  section  2,  an  example 

is  presented  which  demonstrates  that  this  issue  is  vital  to  understanding 

when  (1.3)  holds  and  is  not  a  minor  technical  detail. 

Second,  the  theorems  of  Wong  (1983)  which  study  the  asymptotic  behavior 
CV  Jack 

of  the  functions  R  (X)  and  R  (X)  are  established  only  pointwise  (in  X) . 

But  what  is  really  of  interest  here  is  properties  of  the  maximizers  of  these 

functions,  and  to  make  such  inferences  requires  theorems  which  are  uniform 

in  X.  The  example  of  Section  2  demonstrates  that  uniformity  over  all  X  >  0 

is  impossible.  However,  in  section  3  it  is  shown  that,  under  stronger  assump- 

CV  T3.dc 

tions,  (1.3)  and  the  asymptotic  equivalence  of  R  (X)  and  R  (X)  are  true 
uniformly  over  a  very  reasonable  X  range. 

2 .  Counterexample 

In  Marron  (1984)  it  is  seen  that  if  the  cumulative  distribution  function 
of  f  is 

F(x)  =  e  1//)C  for  x  >  0, 


if  K  is  compactly  supported,  and  if  Xn  tends  to  0  fast  enough  that 


(2.2) 


->•  -00 


in  probability  . 


R^(Xn) 

/'CV 

To  see  the  implications  of  this  on  the  maximizer  of  R  ,  note  that  the 

usual  (see,  for  example,  Rosenblatt  (1971))  optimal  bandwidth  of  Xfi  ~  n  ^ 

~CV 

easily  satisfies  (2.1).  So  the  maximizer  of  R  will  be  (asymptotically) 

very  far  from  optimal,  or  in  other  words  cross-validation  fails  in  this 

^  jack 

setting.  The  maximizer  of  R  can  be  shown  to  suffer  similar  difficulties. 

To  relate  (2.2)  to  the  results  of  Wong  (1983),  note  that  if  K  is  a 
probability  density  which  is  bounded  and  positive  on  a  neighborhood  of  the 
origin,  then  an  easy  analytic  argument  shows  that  there  is  a  constant  M  >  0 
so  that 

-M  <  /  log(f*Kx)dF  <  M, 

for  X  e  (0,1).  Thus 

R log(f*KXn)dF  -v  - 

is  probability  and  so  (1.3)  no  longer  holds.  This  does  not  contradict 

Wong's  Theorem  1  because  there  (1.3)  is  established  pointwise  in  X  >  0, 

but  it  does  show  one  can  not  have  (1.3)  uniformly  over  X  >  0  or  even  for 

X  -*■  0  (as  with  the  optimal  An  ~  n  ^) ,  under  these  assumptions.  Hence, 

the  issues  raised  in  section  1  arc  enjeial  to  understanding  the  behavior 
AfV  A  Tark 

of  R  and  R  ,  and  not  just  technical  details. 

3.  Positive  Results 

The  pathology  of  the  previous  section  is  caused  by  the  fact  that 
f(x)  is  "close  to  0  on  a  set  of  large  measure."  To  avoid  this  (and  thus 
have  it  possible  for  the  maximizer  of  R  or  R  to  have  some  optimality 


properties)  it  will  be  assumed  that  f  is  supported  and  bounded  away  from  0 
on  an  interval  [a,b] .  Gasser  and  Muller  (1979)  (working  in  the  very  similar 
setting  of  regression  estimation)  indicated  that  the  estimator  (1.1)  does 
not  perform  well  near  a  and  b.  To  avoid  technical  details  which  would  ob¬ 
scure  the  issues  under  discussion  here,  the  "boundary  effect"  will  be  ig- 

~CV 

nored.  In  Marron  (1984)  it  is  seen  how  to  modify  R  to  overcome  this 
difficulty. 

Two  technical  points  in  the  proof  of  Theorem  1  of  Wong  (1983)  relate 
to  the  Banach  space  of  bounded  continuous  functions,  C (-«>,«>).  First,  con¬ 
tinuity  of  the  estimator  may  be  a  problem  without  additional  conditions  on 
K  (for  example ,  the  shifted  histogram  of  Rosenblatt  (1956)  is  discontinuous). 
However,  the  choice  of  K  is  of  secondary  importance  in  the  kernel  estimate, 
and  most  selections  of  K  are  continuous.  Here  it  will  be  assumed  that  K 
is  a  probability  density  which  is  positive  at  the  origin  and  is  Lipschitz 
continuous  in  the  sense  that  there  are  positive  constants  and  a  so  that 

(3.1)  I K(x)  -  K(y)  |  <  Cjlx-yl01, 

for  all  x,y.  Most  commonly  used  kernel  functions  satisfy  these  assumptions. 
The  second  technical  point  relates  to  separability,  which  is  required  by  the 
referenced  Strong  Law  of  Large  Numbers  of  Revesz  (1968) ,  but  which  does  not 
hold  for  C(-°°,«>).  However,  since  the  range  of  the  i.i.d.  random  variables 
Xp  X2,  ...  (namely  R )  is  separable,  condition  (3.1)  assures  that  the  kernel 
estimators  reside  in  a  separable  subspace  of  C(-°°,°°). 

To  see  the  range  of  X's  to  be  considered,  let  3  be  a  small  positive 


constant  and  define 


Note  that  for  8  sufficiently  small , 


~rv 

sup  R  V(A)  -  /  log(f*K,)dF  ->  0  a.s. 

Ae[A,X]  A 


Theorem  2: 


Under  the  above  assumptions. 


sup_  R^CX)  -  RJack(X) 
Xg[A^,X] 


0 


a.s. 


The  proofs  of  these  theorems  follow  the  general  outline  of  that  of 
Wong  (1983)  but  are  unfortunately  quite  technical  in  nature,  in  part  because 
of  the  "uniform  over  X"  improvement.  The  details  which  establish  the  present 
version  of  (9)  in  Wong's  argument  are  in  the  appendix. 
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Appendix ;  Proof  of  Theorems 

Following  Wong  (1983) ,  the  key  step  is  to  prove 
|fx(x,X) 


(A.l) 


sup 

xe[a,b]  | 
Xe [ X  ,T) 


f*Kx(x) 


-  1 


a.s. 


The  consideration  of  the  random  variables  in  a  separable  subspace  of  C (-00  ,  00) 
with  the  sup  norm  neatly  answers  measurability  questions  concerning  the 
sequence  of  random  variables  in  (A.l).  Consider  only  n  >  n^  so  that 


f*Kx(x)  >  C2  >  0 


for  all  x  e  [a,b]  and  all  Xe[X^X] .  Choose  partitions  Xp...,Xjn  e  [a,b] 
and  X  p . . . ,  X^  e  [  X,T]  so  that 

X1  =  a’  xJn  =  b>  A1  =  V  XLn  =  \  » 
and  so  that  there  is  a  constant  such  that 

(A. 2)  Jn  <  C3n1+1/a  ,  Ln  <  , 


and  so  that,  for  j  =  2,  ...,  Jn,  &  =  2,  ...,  Ln, 


(A.  5) 


xn  -  xn 

i  j-i 


<  n  '(1+1/a)  , 


^-l^1 


(O'1  $  n"1/2a  * 


where  a  is  the  constant  in  (3.1)  From  here  on,  the  dependence  of  the  above 
quantities  on  n  will  be  suppressed  for  notational  convenience.  Moreover, 
Cp...,Cj2  will  refer  to  constants  (independent  of  n) . 


Lxpand  (A.l)  into 
|fx(x,X) 


(A.  4) 


sup 

xe[a,b] 

X6[X,X] 


f*K,(x) 


-  1 


<  max 

f =1 , . . . ,L 


fxdxj-*> 


f*W 


- 1 


max 


sup 

j=2 . J  xeixj-i-xj! 

£— 2  9  •  • .  ,L  Xg [ X ^  ^ > X 


fx(x,X) 


w 


f*Kx(x) 


f*Kx*<xj) 


The  second  term  on  the  right  hand  side  of  (A. 4)  can  be  bounded  by  writing,  for 
j— 2,  •••,  *J>  Z~2 ,  •  •  •  >  L ,  xe [x j _ ^ ,x j ]  ,  Xc [ X ^ > X^ ] 


fxOc,X) 

£u(xrX) 

-1  11 
<  n  1  l 

i=l 

Kx(x-Xt) 

f*Kx(x) 

f*Kx(x) 

f*Kw(Xj) 

But,  for  i=l,  . . . ,  n, 
(A.  5) 


Kx(x-Xi)  Kxt(xrX^ 

5  |Kx(x-X±)| 

ftKxi(xjH*KxW 

+  ^ 

Kx(x-Xi)-Kx,(xrXi) 

f*Kx(x)  f*Ku(Xj) 

£*KX (X) f *KX£  (Xj) 

f*W 

But,  since  K  is  bounded  and  f  is  bounded  away  from  0  on  [a,b] , 


|Kx(x-X.)| 


f*Ku(Xj)-f*Kx(x) 


5  ty 


f *^X (x) f (x j ) 

-1 


<  CA 

-  4 


-1 


f*Kx<(Xj)-f*Kx(x) 


+  v' 


f*CVx)-KxCx)) 


And  by  integration  by  substitution,  (3.1),  (3.2)  and  (A. 3) 


,-1 


C*(KX«,(Xj^‘KXS  (X)) 


<  X 


-1 


/(r-  K(^ri)  '  K(^))f(y)Jy 

\  xz  \  xz 


<  X"1/ 


x-x . 

K(u+  ^ J  -  K(u) 


f(x+X.u)du  < 


<  C^^lx-x.^X,-1^  <  C5n^n_1^ni+a/2  =  C^'3'^2. 


In  a  similar  spirit,  using  also  the  boundedness  and  compact  support  of  f, 


X_1|£*(Ku(x)-Kx(x))|  =  x'1|/^-K(^)-iK(^)]f(y)dy 

’  9*  Ss 


<  X  V  K(u)  -  —■  K (-y  u)  f(x+X£u)du  < 


<  X  *  1  -  -y  /K(u)f(x+X£  u  )du+  -y  f  K(u)-K(-^  u)  f(x+X£u)du  < 


,  r  -l/2a  .  r  -  l  <  r  n'1/2 

<  C^n  +  C1  1  — T  -  /n 


It  follows  from  the  above  that  the  first  term  on  the  right  hand  side  of 
(A. 5)  tends  to  0.  To  check  the  second  term,  note  that  by  the  boundedness 
of  f  above  0, 

K  (x-X-)-K,  (x,-X.)  [  ,  x-X.  x.-X- 

£*W  X  X£  X  X„  X  x£ 


But,  by  the  boundedness  of  K  and  (A. 3), 


1  -  X'1  KH-1)  5  C,n1/2a 


And  by  the  compactness  of  the  support  of  f  and  (3.1), 


.  X  -X.  X.-X-  I  1  n  [  I  i  . 

”  1  i '  r  i->  l-/-J  i  a  i  ^  n  _  2  ~  ft  \  “  1  i  “  1 


V  K(~irl)  •  *<+-4  i  V*  »  -  V  \  *-*j 

ic 


itf13  in-W-v4  • 


It  follows  from  the  above  that  the  second  term  on  the  right  hand  side  of 


(A. 4)  tends  to  0. 

To  check  the  almost  sure  convergence  of  the  first  term  of  (A. 4), 


for  e  >  0  consider 


max 
j  ,£ 


fX£^Xj  ’P 


-  1 


>  € 


(A. 6) 


i  Jn-Ln-pil£xt(xj.?)  -  f*W'  >eC2>  ' 


Next  pick  q,  an  even  positive  integer,  so  that 
(A.  7)  2+i+2a-3q<-l. 


Then  for  each  j  and  £,  using  a  Marc ink iewicz-Zygmund  inequality  (see,  for 
example,  Theorem  2  of  Section  10.3  of  Chow  and  Teicher  (1978)) 


(A.8)  P[|fu(Xj,X)  -  f*Ku(Xj)|  >  eC2]  = 


,  n  .  x.-X.  x.-X- 

ll  -  Hi(-J-1)) 

p 

>  e  C? 

1=1  £  £  £ 

n  x.-X. 


x .  -X. 


HX"  l  /  (eC2)q  < 

nA£  i=l  Ai  A£  2 


<  Cn(cC2)'q(nX;rqE 


n  x.-X- 


X.-X: 


(  l  (K(4-i)  -  EK(— L-4)2)q/2 


i=l 


<  Cn(cC2)-q(nX?rq  nq/2  H(K(^i^l)  -  EK(ii-^))q  < 

<  Cu  n"q  nq/2(X.)*q  <  Cl2(nX2)'q/2 


for  each  n.  From  (A. 6) ,  (A.8),  (A. 2)  and  (3.2) 


00 


.  r2n  r  2-^-+2ot-0q 

<  n0  +  C3C12  lna  <  “  ’ 


n=no+1 


by  (A. 7).  Hence, 


max 

j  A 


W>9 


-  1 


£*w 


converges  completely  and  consequently  a.s.  to  0. 

The  rest  of  the  proofs  follow  as  in  Wong  (1983) ,  except  that  the  step 

VV*(i)l 


•1 


n 


sup  n  ~  \  log 
A  i=l  f*Kx(Xi) 


0 


a.s. 


takes  more  work  to  verify.  However,  since  f  and  f*K^are  bounded  away  from 
0,  this  may  be  easily  done  using  an  argument  which  is  similar  to  (but 
simpler  than)  the  preceding  arguments. 
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